Skip to content

perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442

Open
samuelburnham wants to merge 17 commits into
kernel-riscvfrom
sb/kernel-perf
Open

perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442
samuelburnham wants to merge 17 commits into
kernel-riscvfrom
sb/kernel-perf

Conversation

@samuelburnham

Copy link
Copy Markdown
Member

This pull request implements Rust-kernel and ixon optimizations for the Zisk guest, measured against the base #411 as deterministic guest cycle counts (ziskemu -m) over dumped shard inputs (zisk/scripts/bench-cycles.sh), with every counted run's committed failures publics word verified zero.

Key Results

  • Int16.instRxcHasSize_eq: 5.70 B → 42.0 M guest steps (135.6×); Int32/Int64 variants confirmed at the same magnitude (~42 M) — in-circuit cost is now integer-width-independent
  • Cycle suite total: 17.53 B → 5.77 B steps (3.04×, −67.1%)
  • Environment-machine WHNF (final two kernel commits): −19.3% suite net; Vector.extract_append._proof_1 −32.4%; the Init Array/Vector Extract proof family −22% to −33% each (five multi-billion-step constants)
  • Std.Tactic.BVDecide.BVExpr.bitblast.goCache_Inv_of_Inv._mutual (28.35 B steps, the most expensive constant measured in Init+Std): previously crashed the 512 MB guest (OOM-class null read) → now completes as a valid kernel pass — lazy closures cut peak RAM, not just cycles
  • Kernel hardened against adversarial inputs: term identity moved off per-node content hashing with the threat model documented (docs/kernel_identity.md), blob bytes verified at load in all deserializers
input base (#411) this PR Δ speedup
_private.Init.Data.Range.Polymorphic.SInt.0.Int16.instRxcHasSize_eq 5,697,166,200 42,012,168 −99.3% 135.6×
_private.Init.Data.Vector.Extract.0.Vector.extract_append._proof_1 7,690,887,211 3,344,459,375 −56.5% 2.3×
List.mergesort synthetic env 4,005,320,347 2,290,310,593 −42.8% 1.7×
Nat.add_comm 18,690,815 11,532,048 −38.3% 1.6×
Array.binSearchAux._unary 95,800,411 62,862,965 −34.4% 1.5×
Nat.gcd_comm 6,669,821 5,550,752 −16.8% 1.2×
Batteries.RBMap 13,940,412 12,321,092 −11.6% 1.1×
String.append 4,560,304 4,070,450 −10.7% 1.1×
total 17,533,035,521 5,773,119,443 −67.1% 3.04×

Fifteen Commits (oldest first)

  1. bench: guest cycle harness + native single-constant check

    • bench-cycles.sh suite over dumped guest stdins with failures-word validation; check_one example mirrors the guest's reuse-mode check natively
  2. kernel: keep symbolic Nat offsets compact (whnf stuck + offset def-eq)

    • Nat.add x lit / Nat.div|mod x k stay stuck as compact offsets instead of materializing succ^n(x) chains or the division algorithm; bulk offset def-eq decides offset pairs directly
    • The Int16.instRxcHasSize_eq collapse (5.70 B → 56 M) and the UTF-8-codec-class fix
  3. kernel: intern-assigned uids replace per-node blake3 content hashing

    • Term identity becomes intern-table-assigned sequential u64s; shallow structural keys for hash-consing; blake3 stays only at the Ixon boundary (content addresses, Merkle roots, proof-carrying-code claims)
    • The broadest single win: −30–36% on every reduction-heavy input
  4. kernel: memoized prim-family dispatch in the WHNF/def-eq reduction loops

    • Constant heads classified once per address (Native/BitVec/Nat/Decidable/Str); at most one family recognizer runs per iteration instead of the five-probe gauntlet
  5. ixon: defer per-constant address verification to first materialization

    • Constants verify on first get(); blob bytes still verified eagerly at load
  6. bench: add Int32/Int64 instRxcHasSize_eq to the cycle suite

  7. docs: kernel uid identity vs Ixon content addressing

    • Two-layer identity model and why uid collisions are not an attack surface (uids are assigned, never computed from input)
  8. kernel: fix PrimFamily visibility/qualification warnings

  9. kernel: port jcb/fixes H-15 whnf probe pre-filter + H-12 nat output caps

    • Allocation-free transient-nat spine probe ahead of the cache lookup; caps on Nat primitive output sizes
  10. kernel+ixon: harden term identity against adversarial inputs

    • Blob bytes verified against their address at load in all three deserializers; structural equality audited against the adversarial model
  11. kernel: privatize uid-accepting constructors; document cross-shard uid story

  12. docs: design for the environment-machine WHNF port

    • Full design for the Krivine-style machine (ported from the proven IxVM implementation), including the exit-to-outer-loop contract and the known regression class
  13. kernel: environment-machine WHNF (Phase A — lazy substitution on the beta path)

    • whnf_core's App arm enters a closure machine when a beta fires: beta/zeta are O(1) environment pushes; substitution materializes only at machine exits (clo_subst readback), so work is proportional to what the reduction consumes
    • Every exit re-enters the existing loop: ambient zeta, iota, prim dispatch, cheap-mode def-eq flags, and all caches stay byte-identical
  14. kernel: closure-iota at the machine's recursor exit (Phase B)

    • Recursor spines consumed lazily: only the major premise materializes; params/motives/minors and post-major args ride through as closures, so unselected minors (dropped match/Decidable branches) are never substituted and never read back
    • K recursors, literal majors, and struct-eta deliberately miss to the unchanged plain path, preserving the Nat transient-work and linear-rec/offset shortcuts exactly
  15. docs: env-machine WHNF design is implemented (Phases A+B)

zisk/scripts/bench-cycles.sh runs the standard dumped-input suite
(~/benchdata/zisk-inputs) through ziskemu, records steps, and verifies
the committed `failures` publics word is zero. check_one mirrors the
Zisk guest's one-work-item path natively so IX_* instrumentation can
target a single expensive constant.
Port of IxVM dbc4177/5dcab7f/f7cfe23 to the Rust kernel, three
coordinated changes:

- whnf (try_nat_offset_stuck): leave `Nat.add base (Lit n)` (symbolic
  base, n>0) and `Nat.div/mod base (Lit k)` (k>=2) STUCK in compact form
  instead of delta-unfolding into succ^n towers / the division
  algorithm. Iota over such majors already works via
  cleanup_nat_offset_major.
- linear-rec: collapse `succ^o(Nat.rec base succ-step (Lit n))` with a
  SYMBOLIC base to `Nat.add base (Lit n+o)` instead of declining into n
  iota steps.
- def-eq (try_def_eq_offset): decompose both sides to base + offset
  (literals, succ layers, add-lit read in O(1)) and strip the shared
  offset in ONE step. The previous one-succ-peel recursed through full
  is_def_eq per layer, which blew MAX_DEF_EQ_DEPTH=2000 on deep offsets.

Guest cycles (ziskemu, vs d2ea3d9 baseline):
  Int16.instRxcHasSize_eq   5,697,166,200 -> 56,028,269  (-99.0%)
  binsearch                    95,800,411 -> 87,735,518  (-8.4%)
  Nat.add_comm                 18,690,815 -> 17,805,210  (-4.7%)
  mergesort shard           4,005,320,347 -> 3,920,521,476 (-2.1%)
  Vector.extract_append._proof_1 unchanged (+0.02%), rest ~0.

604 kernel tests pass (one test updated: whnf of `x + 2` now asserts
the compact stuck form; new test covers tower-vs-compact def-eq).
Every KExpr/KUniv construction used to compute a blake3 hash of its
content for identity (intern key, cache keys, equality) — ~20% of guest
cycles on reduction-heavy constants (app_hash 22-33% cumulative).
Identity is now an intern-assigned u64 uid:

- uids come from a process-global atomic counter and are NEVER reused,
  so uid equality implies structural equality and cache keys cannot
  alias across intern-table clears (a stale key can only miss).
- InternTable keys on shallow structural keys (ExprKey/UnivKey): variant
  tag + child uids + semantic payload, mirroring exactly what the old
  content hash covered (display names, binder info, mdata excluded).
  intern_expr/intern_univ canonicalize children recursively when handed
  nodes built outside the table, preserving the old "structural identity"
  interning semantics.
- PartialEq for KExpr/KUniv is now structural with the uid fast path
  (canonical-vs-canonical compares O(1)); hash_eq remains the
  fast-but-incomplete uid check for cache/quick-path callers where a
  false negative only costs a fallback.
- Cache ctx components stay blake3 (CtxAddr) but hash uids instead of
  content addrs; push_local/push_let intern their frame types so equal
  context suffixes share cache entries.
- ingress's hash-first intern probe becomes key-first (same skip-the-
  construction win, no hash).

Soundness note: affirmative identity (equal uid => equal term) is what
def-eq quick paths and caches rely on, and it holds by construction;
incomplete negatives only cost cache misses. Nat/Str literals keep their
blake3 blob Addresses (ingress-meaningful).

Guest cycles (ziskemu, vs previous commit):
  mergesort shard      3,920,521,476 -> 2,526,226,208  (-35.6%)
  Vector.extract_..._1 7,692,517,694 -> 5,369,879,578  (-30.2%)
  Nat.add_comm            17,805,210 ->    12,687,822  (-28.7%)
  binsearch                87,735,518 ->    64,562,040  (-26.4%)
  Int16.instRxc..._eq      56,028,269 ->    45,024,529  (-19.6%)
  natgcdcomm/stringappend/rbmap: -11.2% / -6.9% / -2.7%

604 kernel tests pass (16 hash-determinism tests updated to assert the
preserved property — identity ignores names/bi/mdata — via structural
equality, shallow-key equality, or shared-table interning).
The WHNF loops (full + no-delta) and the def-eq lazy-delta loop probed
all five primitive recognizers (native/bitvec/nat/decidable/string)
every iteration; each collects its own app spine and runs its own
gauntlet of 32-byte address compares — pure tax on ordinary
constant-headed terms. Classify the head once per iteration via an
allocation-free app-chain walk + per-address memo
(KEnv::prim_family_cache), and call at most the one matching family
reducer. Family head-address sets are disjoint, so dispatch order
semantics are unchanged. The nat-offset-stuck probe gates on the Nat
family too. Mirrors the IxVM prim_family dispatch memo (ix-aiur,
measured there as part of -28% on Vector._proof_2).

Guest cycles (ziskemu, vs previous commit):
  mergesort shard      2,526,226,208 -> 2,399,597,758  (-5.0%)
  Vector.extract_..._1 5,369,879,578 -> 5,112,313,900  (-4.8%)
  Nat.add_comm            12,687,822 ->    12,249,578  (-3.5%)
  Int16.instRxc..._eq      45,024,529 ->    43,835,207  (-2.6%)
  binsearch                64,562,040 ->    64,095,836  (-0.7%)
  others ~0 (+25 steps cache-population noise on two inputs)

604 kernel tests pass.
Env::get_anon paid one blake3 content-hash per constant at parse time.
The binding check moves into LazyConstant::get() (pending_addr field):
constants shipped in a closure but never forced by the typechecker are
never hashed, while everything the kernel certifies is still verified
on materialization — checking a constant forces it. The compile-side
from_constant path (pre-populated cache) is exempt as before.

Guest cycles: rbmap -9.5%, natgcdcomm -6.4%, stringappend -4.2%,
Int16 -3.7%, binsearch -1.6%, nataddcomm -0.3%; mergesort/vector
+0.01% (re-verification of the few constants materialized more than
once). 200 ixon + 604 kernel tests pass (tampering test updated to
assert rejection at get() instead of at parse).
Both complete at ~42M steps, matching Int16 — the nat-offset compaction
makes the cost range-independent (pre-fix, Int16's 2^16 range alone was
5.7B steps and Int32/Int64 would not have completed).
Records the two-identity-layer boundary after 1e3029d: constants/blobs
keep their blake3 Ixon Addresses (claims, Merkle roots, proof store,
cross-run reuse, and per-constant inclusion proofs are all unchanged),
while the removed blake3 hashing was the kernel-internal per-node
scheme used only for intern/cache identity inside one checker run —
never serialized, never part of a proof artifact. Documents the
soundness argument (equal uid => equal term, never reused), the
boundary rule (uids must not escape into serialized artifacts), and
what a future sub-constant commitment feature would need to recompute
at the egress boundary.
PrimFamily is referenced by the public KEnv::prim_family_cache field, so
it must be pub; drop a redundant path qualification.
H-15 (perf): the transient-nat probes run on every whnf/whnf_core call
BEFORE the cache lookup; they paid two collect_app_spine Vec allocations
plus a KConst::Recr clone per call on the overwhelmingly common
non-Nat-recursor path. Pre-filter with an allocation-free
spine_head_and_len; spine collection + Recr lookup now happen only for
actual Nat-recursor heads. Also drop the redundant rules.clone() in the
iota IotaInfo build (rules is already owned from the try_get_const
clone).

H-12 (hardening): output-size cap (2^26 bits) on natively computed
shiftLeft/pow results — a ~12-byte adversarial term could otherwise
demand a multi-GiB allocation (cycle explosion in-guest). Oversized
results stay stuck, like the existing pow exponent guard.

Ported from jcb/fixes 3763356 (pre-workspace layout), adapted to
kernel-riscv.

Guest cycles (ziskemu, vs previous commit):
  mergesort shard      2,399,808,204 -> 2,313,445,541  (-3.6%)
  Vector.extract_..._1 5,111,066,294 -> 4,949,833,244  (-3.2%)
  Nat.add_comm            12,217,888 ->    11,798,013  (-3.4%)
  Int16/32/64 rxc          ~42.2M    ->    ~41.3M      (-2.1%)
  binsearch                63,115,445 ->    62,640,316  (-0.75%)
  small inputs ~0. 604 kernel tests pass.
Closes the adversarial-binding review of the uid-interning design:

- ixon: blob bytes are blake3-verified against their address at load in
  all three env deserializers (ports jcb/fixes H-1). This is the one
  place adversarial CONTENT enters kernel identity: ExprKey::Nat/Str and
  literal structural equality key on the blob Address (mirroring the old
  content hash), which is sound only if address binds bytes. Costs
  +0.06..+0.29% guest cycles (blobs are a small fraction of env bytes;
  constants stay lazily verified).
- kernel: fresh_uid aborts on counter exhaustion instead of wrapping
  (>2^67 guest cycles to reach; identity soundness should not rest on
  "probably won't happen").
- kernel: literal structural-equality arms compare values as well as
  blob addresses (defense in depth — redundant under the load-time
  check, degrades to inequality if that invariant were ever violated).
- kernel: intern hits assert structural equality in debug builds.
- docs/kernel_identity.md: adversarial-model section — why uid identity
  cannot be cache-poisoned (uids are assigned by a sequential counter,
  never computed from content, so there is no hash function to collide;
  intern keys are exact structural Eq, never digests; cache hits require
  uid equality; map-bucket collisions cost a key compare, not a wrong
  hit).

604 kernel + 200 ixon tests pass.
…d story

The *_mdata_with_addr constructors were the only API through which a
caller-supplied (hence potentially duplicated) uid could enter a node;
nothing outside expr.rs uses them since the ingress key-first rework, so
make them private — uid uniqueness per process is now enforced by
visibility, not convention. Docs: spell out the within-shard (impossible
by assignment) vs between-shards (certain and harmless; uids never cross
the process boundary) collision analysis.
Krivine-style machine for the structural WHNF fragment, ported from
the proven IxVM implementation (ix-aiur d0d3375 Phase A hybrid entry,
b4c4ea3 Phase B closure-iota; measured there Int16 -33.8%, Vector
-17.1%, mergeSort -6.8%). Covers the Rust-specific mapping the IxVM
side doesn't have: cheap_proj/cheap_rec threading, fuel retention
(adversarial posture), ambient let/LDecl zeta via the exit-to-outer-
loop contract, iterative loop for stack discipline, clo_subst readback
with lbr guards + per-call scratch, unchanged cache/uid semantics.
Phased A (beta/zeta) then B (closure-iota, the UTF-8-class payoff),
with the known curried-sharing regression class added to the bench
suite before implementation.
…beta path)

Port of the IxVM environment machine (~/ix-aiur d0d3375; design in
docs/env_machine_whnf.md). whnf_core's App arm enters a Krivine-style
machine when a beta fires: beta/zeta become O(1) environment pushes
onto a persistent cons-list of closures, and substitution materializes
only at machine exits (clo_subst readback) — work proportional to what
the reduction consumes, not to every beta performed. A beta chain
ending in another beta never materializes its intermediate bodies.

- subst.rs: Clo/MEnv closure types (carried env length; per-Clo
  OnceCell readback memo standing in for Aiur's global memoization),
  clo_subst/clo_readback with lbr guards, per-call (uid, depth) memo
  from a new InternTable scratch pool, results interned like subst.
- whnf.rs: iterative machine_whnf loop + machine_exit, replacing the
  multi-lambda peel + simul_subst block. Fuel charged on beta/zeta at
  the eager path's one-substitution-event granularity. Every exit
  reads back and re-enters the outer loop, so ambient let/LDecl zeta,
  iota, prim dispatch, and cheap_proj/cheap_rec def-eq discipline are
  unchanged; all caches and keys untouched.
- bench-cycles.sh: add natreclinear/natfoldsucc (foldAdd-class
  sentinels for the known curried-sharing regression class).

609 kernel + 200 ixon tests pass; native check_one true on
Nat.add_comm, Nat.gcd_comm, Int16.instRxcHasSize_eq, natRecLinearCheck,
Nat.fold_succ; perf_check whole-env failure parity with HEAD on 11
envs. Guest cycle benchmarks pending (box busy).
Port of IxVM try_iota_c (~/ix-aiur 097e9d5). The machine's Const exit
no longer reads its spine back when the head is a recursor on the main
ctor-rule path: only the major premise materializes (readback + the
same full whnf the eager path uses), and on a hit the rule RHS
re-enters the machine with params/motives/minors and post-major
arguments as their ORIGINAL closures plus the ctor fields wrapped
closed. The rule's Lam-chain betas push them straight into an
environment, so unselected minors — dropped match/Decidable branches,
the UTF-8 codec class — are never substituted and never read back.

Off-main-path cases miss to the plain readback exit by construction,
preserving semantics exactly: K recursors (ctor synthesis needs
infer + def-eq), Nat literal majors (transient-work cache discipline
and the linear-rec/offset shortcuts stay on the plain path), Str
literal majors, struct-eta candidates, and stuck majors. cheap_rec
mode skips machine-iota entirely, mirroring the eager gating. A miss
costs one readback of the major; the plain path's whnf of it then
hits a warm cache.

Machine-native delta (IxVM Phase C1.5) does NOT port at this layer:
their machine lives in a whnf that includes delta, while ours sits
inside whnf_core, which must stay delta-free for def-eq's lazy-delta
unfold ordering. Spanning the whnf delta loop with closures is a
separate, larger restructure.

610 kernel + 200 ixon tests pass (new closure-iota-via-beta unit test
covers the spine arithmetic incl. post-major shift); native check_one
true on the usual five constants; perf_check whole-env failure parity
on 11 envs. Guest cycle benchmarks pending alongside Phase A's.
map_or_else over map().unwrap_or_else(), pass ExprKey by reference in
timed_intern_or_build, explicit clone for spec_params dedup snapshot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant